We will be using the taxlot_sample data for this lab. Download taxlot_sample.csv
It’s generally a good idea to start your modeling work, including diagnostics, with descriptive statistics and visualization:
require(tidyverse)
taxlot_sample <- read_csv("data/taxlot_sample.csv")
set.seed(100)
taxlot_sfr <- taxlot_sample %>%
filter(sfr == 1)
require(ggplot2)
# histograms for outcome and predictor variables
ggplot(data=taxlot_sfr, aes(x=TOTALVAL)) +
geom_histogram()
ggplot(data=taxlot_sfr, aes(x=BLDGSQFT)) +
geom_histogram()
# boxplots for outcome and predictor variables
ggplot(data=taxlot_sfr, aes(x="TOTALVAL", y=TOTALVAL)) +
geom_boxplot()
ggplot(data=taxlot_sfr, aes(x="BLDGSQFT", y=BLDGSQFT)) +
geom_boxplot()
# scatter plot between outcome and predictor variable
ggplot(data=taxlot_sfr, aes(x=BLDGSQFT, y=TOTALVAL)) +
geom_point() + geom_smooth(method="lm", se=FALSE)
# Which observations do you suspect are outliers/influential observations?
In many cases, it is useful to find out which observations they are and what are the values of other variables for these observations. For these, interactive plots would be handy:
if (!require(plotly))
install.packages("plotly") & require(plotly)
g <- ggplot(data=taxlot_sfr, aes(x=BLDGSQFT, y=TOTALVAL, text=FID)) +
geom_point() + geom_smooth(method="lm", se=FALSE)
ggplotly(g, tooltip=c("FID"))
# Now to identify the observation in the top-right corner, move your mouse to hover above the data point,
# and a tooltip will show its FID (the unique identifier) value: 1098, and we can get the row representing
# the observation:
taxlot_sfr %>% filter(FID==1098)
## # A tibble: 1 x 17
## FID TOTALVAL BLDGSQFT YEARBUILT GIS_ACRES condo apt sfr com dpioneer dfwy dpark dmax nodedens dbikehq slope10 dbusy
## <int> <int> <int> <int> <dbl> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
## 1 1098 1737570 5769 1920 0.4204119 0 0 1 0 1.459124 0.7571349 0.05611497 0.9215101 240 0.295714 1 0.1475817
#We can see it is probably a large house on a less than half acre lot, built in 1920 at a very good location (1.5 miles from Poineer square; 0.06 miles from a park, etc)
# Now to identify the observation in the top-right corner, move your mouse to hover above the data point,
# and a tooltip will show its FID (the unique identifier) value: 1098, and we can get the row representing
# the observation:
taxlot_sfr %>% filter(FID==1098)
## # A tibble: 1 x 17
## FID TOTALVAL BLDGSQFT YEARBUILT GIS_ACRES condo apt sfr com dpioneer dfwy dpark dmax nodedens dbikehq slope10 dbusy
## <int> <int> <int> <int> <dbl> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
## 1 1098 1737570 5769 1920 0.4204119 0 0 1 0 1.459124 0.7571349 0.05611497 0.9215101 240 0.295714 1 0.1475817
#We can see it is probably a large house on a less than half acre lot, built in 1920 at a very good location (1.5 miles from Poineer square; 0.06 miles from a park, etc)
(Adapted from Joseph Broach’s 2016 course material)
Expected: random cloud of points and a roughly horizontal (Red) indicator of conditional residual means given fitted values
Departures: residuals that drift up or down considerably as Y increases (wrong explanatory variables, nonlinearity, non-fixed parameters at population level)
Expected: points more or less fall on the 45-degree line, matching expectations. There will be some departure in real world data, especially at the tails.
Departures: residuals that seem disinterested in following the line, or that snake noticeably from side to side (non-normality, could be omitted variable or nonlinear relationship)
Expected: More or less horizontal fit curve with no strong evidence that expected error magnitude changes over the range of Y.
Departures: residuals that trend up or down considerably as Y increases (potential heteroskedasticity or autocorrelation in errors)
Expected: red fit curve close to horizontal and no points outside the 0.5 Cook’s distance bands
Departures: specific influential observations (bad data or rare data that may be difficult to infer about from sample); this is less a specific departure from assumptions and more a warning that some sample points are “exploiting” our OLS estimation technique and possibly given too much weight as a result.
lm_sfr <- lm(TOTALVAL ~ YEARBUILT + BLDGSQFT + GIS_ACRES + dpioneer + dmax, data=taxlot_sfr)
summary(lm_sfr)
##
## Call:
## lm(formula = TOTALVAL ~ YEARBUILT + BLDGSQFT + GIS_ACRES + dpioneer +
## dmax, data = taxlot_sfr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -549809 -40008 -3202 36715 626089
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1035340.37 197960.55 5.230 2.07e-07 ***
## YEARBUILT -403.89 104.40 -3.869 0.000117 ***
## BLDGSQFT 135.61 3.89 34.856 < 2e-16 ***
## GIS_ACRES 295283.37 30999.29 9.525 < 2e-16 ***
## dpioneer -39582.98 1686.66 -23.468 < 2e-16 ***
## dmax 3145.61 3317.25 0.948 0.343228
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 86000 on 994 degrees of freedom
## Multiple R-squared: 0.7514, Adjusted R-squared: 0.7502
## F-statistic: 601 on 5 and 994 DF, p-value: < 2.2e-16
plot(lm_sfr, ask=FALSE)
olsrr packageif (!require(olsrr))
install.packages("olsrr") & require(olsrr)
# leverage (hat)
leverage <- ols_leverage(lm_sfr)
ols_rsdlev_plot(lm_sfr)
# Cook's distance
ols_cooksd_chart(lm_sfr)
# DFFITS
ols_dffits_plot(lm_sfr)
# DFBETAS
ols_dfbetas_panel(lm_sfr)
It is a scatter plot of residuals on the y axis and fitted values on the x axis to detect non-linearity, unequal error variances, and outliers.
ols_rvsp_plot(lm_sfr)
ols_rsd_qqplot(lm_sfr)
# hypothesis test of normality of residuals
ols_norm_test(lm_sfr)
## -----------------------------------------------
## Test Statistic pvalue
## -----------------------------------------------
## Shapiro-Wilk 0.8991 0.0000
## Kolmogorov-Smirnov 0.0897 0.0000
## Cramer-von Mises 84.0093 0.0038
## Anderson-Darling 16.9564 0.0000
## -----------------------------------------------
ols_bp_test(lm_sfr)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ------------------------------------
## Response : TOTALVAL
## Variables: fitted values of TOTALVAL
##
## Test Summary
## --------------------------------
## DF = 1
## Chi2 = 861.8452
## Prob > Chi2 = 1.933814e-189
# standard variance-covariance matrix
vcov0 <- vcov(lm_sfr)
# convert to correlation
vcov0
## (Intercept) YEARBUILT BLDGSQFT GIS_ACRES dpioneer dmax
## (Intercept) 39188379730.4 -2.064065e+07 133496.86296 -196310620.16 137427155.953 1.309633e+08
## YEARBUILT -20640652.1 1.090022e+04 -83.49863 103217.56 -77830.449 -7.181746e+04
## BLDGSQFT 133496.9 -8.349863e+01 15.13591 -44496.71 2074.536 -3.194746e+02
## GIS_ACRES -196310620.2 1.032176e+05 -44496.71348 960956091.44 -12825833.371 -1.305844e+07
## dpioneer 137427156.0 -7.783045e+04 2074.53629 -12825833.37 2844822.573 -5.215829e+05
## dmax 130963340.4 -7.181746e+04 -319.47462 -13058438.95 -521582.919 1.100413e+07
# Heteroskedasticity-Consistent variance covariance matrix
require(car)
vcov_hc3 <- hccm(lm_sfr, type="hc3")
vcov_hc3
## (Intercept) YEARBUILT BLDGSQFT GIS_ACRES dpioneer dmax
## (Intercept) 7.048578e+10 -3.709466e+07 -30315.91239 1131011898.5 269284975.137 239474701.266
## YEARBUILT -3.709466e+07 1.957207e+04 -25.18230 -696437.1 -146022.539 -122310.287
## BLDGSQFT -3.031591e+04 -2.518230e+01 61.75066 -124379.3 2257.507 -8035.445
## GIS_ACRES 1.131012e+09 -6.964371e+05 -124379.27896 5143372479.3 -65293301.459 -11178992.757
## dpioneer 2.692850e+08 -1.460225e+05 2257.50736 -65293301.5 4141156.696 616312.300
## dmax 2.394747e+08 -1.223103e+05 -8035.44545 -11178992.8 616312.300 7984659.477
# In presence of Heteroskedasticity, vcov_hc3 is larger than vcov0, to redo hypothesis tests
# with the Heteroskedasticity-Consistent variance covariance matrix
if (!require(lmtest))
install.packages("lmtest") & library(lmtest)
coeftest(lm_sfr, vcov_hc3)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0353e+06 2.6549e+05 3.8997 0.0001028 ***
## YEARBUILT -4.0389e+02 1.3990e+02 -2.8870 0.0039738 **
## BLDGSQFT 1.3561e+02 7.8582e+00 17.2570 < 2.2e-16 ***
## GIS_ACRES 2.9528e+05 7.1717e+04 4.1173 4.151e-05 ***
## dpioneer -3.9583e+04 2.0350e+03 -19.4513 < 2.2e-16 ***
## dmax 3.1456e+03 2.8257e+03 1.1132 0.2658876
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Variance inflation factors measure the inflation in the variances of the parameter estimates due to collinearities that exist among the predictors. The general rule of thumb is that VIFs exceeding 4 warrant further investigation, while VIFs exceeding 10 are signs of serious multicollinearity requiring correction.
ols_vif_tol(lm_sfr)
## # A tibble: 5 x 3
## Variables Tolerance VIF
## <chr> <dbl> <dbl>
## 1 YEARBUILT 0.7155483 1.397530
## 2 BLDGSQFT 0.7939384 1.259544
## 3 GIS_ACRES 0.7985859 1.252213
## 4 dpioneer 0.6864766 1.456714
## 5 dmax 0.8759423 1.141628